Student name: Cynthia Pedrasa
Student pace: self paced
Scheduled project review date/time: Wed, Dec 18, 2019 7:00pm - 7:30pm (EDT)
Instructor name: Eli Thomas
Blog post URL: https://cpedrasa.github.io/racial_bias_in_machine_learning_algorithm
Problem Description - The question we are trying to answer as a consultant in the Real-Estate Investment Firm is as follows:
"What are the top 5 zip codes nationwide with the highest Return on Investment (ROI) in 5 years
that are within accepted standard deviation threshold (low volatility or variability in returns)?"
The dataset provides the monthly average house sales price nationwide
from April 1996 to April 2018 or just under 10 years of data.
The Values are the average price and there are 1329 observations. The data set is credited to Zillow.
Understanding the limitations of our data and what potential questions can be answered by data is important and will help define the scope or goal for this project.
Data We have monthly average house price for given zip codes(RegionName) for given dates.
Questions we will try to answer
I. Data Preparation
II. Feature Engineering
III. Time Series Structure
IV. Evaluate Models
V. Forecast Models
#import packages
import warnings
warnings.filterwarnings('ignore') #ignore the warnings to avoid a lot of noise from running the procedure
import pandas as pd
from pandas import Series
from pandas import Grouper
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import set_matplotlib_formats
plt.style.use('ggplot')
import seaborn as sns
import folium
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
import scipy
import statsmodels
import math
from statsmodels.tsa.stattools import adfuller #Augmented Dickey Fuller Test
import pmdarima as pm
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.seasonal import seasonal_decompose
# Load the zillow_data.csv data set
rdf = pd.read_csv('zillow_data.csv', header=0, index_col=None, parse_dates=True,squeeze=True)
#Look at the data
display(rdf.head())
#Look at the shape of data
display(rdf.shape)
# of columns, #of rows
display(rdf.info())
#Let's check the data types of the first seven columns. See if any columns have nulls.
print(rdf.loc[:, 'RegionID':'SizeRank'].info())
#display(df.iloc[:7,:7].dtypes)
print('\n')
#Let's check the unique data types for all date columns after SizeRank
print('Date columns have data types of :',rdf.iloc[8:271,8:271].dtypes.unique())
# display descriptive statistics
display(rdf.describe(percentiles=[0.25,0.5,0.75,0.85,0.95,0.99]))
#Let's summarize
print('How many rows are in the dataset?' ,len(rdf))
print('How many columns are in this dataset?', len(rdf.columns))
print('What\'s the total # of entries?',rdf.size)
print('How many are null values?',rdf.isna().sum().sum())
print('What is the proportion of null values?', round((rdf.isna().sum().sum()/rdf.size) *100,1),"%")
#Let's check validity of zip codes
rdf["RegionName"]= rdf["RegionName"].astype(str)
booleans = []
for z in rdf.RegionName:
if len(z) == 5:
booleans.append(True)
else:
booleans.append(False)
booleans[0:5]
is_valid = pd.Series(booleans)
print('There are',len(rdf[~is_valid]),'zipcodes with len <> 5. This comprised', len(rdf[~is_valid])/len(rdf)*100, '%')
display(rdf[~is_valid].head())
After checking validity of these 4-digit codes, it seems the leading zero is truncated, but they are valid zipcodes
#Let's make a copy of our database as we will look at some calculated fields for selecting top 5 zip codes
dfs=rdf.copy()
dfs.dropna(inplace=True)
dfs.shape
https://www.mashvisor.com/blog/how-to-calculate-real-estate-appreciation/
To get the current HousePriceIndex go to:
https://www.fhfa.gov/AboutUs/Reports/ReportDocuments/HPI_August2019.pdf
Step 1: - Look up the FHFA House Price Index for each State to calculate Future Growth for 5 years needed to calculate Future Value to estimate real estate appreciation
Step 2: Calculate Future Value and Appreciation. Example:
Comparative market analysis returned a fair market value of $150,000 on a potential investment property. A real estate investor plans to hold onto the rental property for 5 years before selling. Let’s see how much the future value of real estate will be thanks to appreciation:
Future Growth = (1 + 0.034)^5
Future Growth = 1.18
Future Value = (1.18) x (150,000)
Future Value = $177,000
Step 3: Consider Market Volatility in making the final selection
In five years, the investment property will be worth approximately $177,000.
ROI Formula: (how much we buy it) - (how much you sell for) divide by (original price)
ROI can be impacted by the following factors:
#Since our data is current only as of April 2018, let's get an estimate of Future Growth
#by adding 1 to US Housing Price index and increase to the power of 5
#to estimate appreciation in 5 years
q1 = """SELECT *, (([2018-04] - [1996-04])/ [1996-04])*100 as Incr_rate ,
CASE WHEN State in ( 'HI', 'AK', 'WA', 'OR', 'CA')
THEN 1.211 --((1+0.039)**5)
WHEN State in ( 'MT', 'ID', 'WY', 'NV', 'UT', 'CO', 'AZ', 'NM')
THEN 1.370 --((1+0.065)**5)
WHEN State in ( 'ND', 'SD', 'MN', 'NE', 'IA', 'KS', 'MO')
THEN 1.258 --((1+0.047)**5)
WHEN State in ( 'OK', 'AR', 'TX', 'LA' )
THEN 1.258 --((1+0.047)**5)
WHEN State in ( 'MI', 'WI', 'IL', 'IN', 'ID', 'OH')
THEN 1.295 --((1+0.053)**5)
WHEN State in ( 'KY', 'TN', 'MS', 'AL')
THEN 1.217 --((1+0.04)**5)
WHEN State in ( 'ME', 'NH', 'VT', 'MA', 'RI','CT')
THEN 1.276 --((1+0.05)**5)
WHEN State in ( 'NY', 'PA', 'NJ')
THEN 1.211 --((1+0.039)**5)
WHEN State in ( 'DE', 'MD', 'DC', 'VA', 'WV','NC', 'SC', 'GA', 'FL')
THEN 1.228 --((1+0.042)**5) ELSE 1
END as FutureGrowth
FROM dfs WHERE Incr_rate > 0
ORDER BY Incr_rate DESC;"""
df_hpi = pysqldf(q1)
# Calculate the estimated Future Value, Appreciation and ROI
q2 = """SELECT *,
(FutureGrowth * [2018-04]) as FutureValue,
((FutureGrowth * [2018-04]) - [2018-04]) as Appreciation,
(((FutureGrowth * [2018-04]) - [2018-04]) / [2018-04])*100 as ROI
FROM df_hpi Order by ROI DESC;"""
df2 = pysqldf(q2)
df2.head()
#https://www.wallstreetmojo.com/realized-volatility/
'''Realized volatility is the assessment of variation in returns for an investment product by analyzing its historical returns within a defined time period. Assessment of degree of uncertainty and/or potential financial loss/gain from investing in a firm may be measured using variability/ volatility in stock prices of the entity. In statistics, the most common measure to determine variability is by measuring the standard deviation, i.e variability of returns from the mean. It is an indicator of the actual price risk.'''
#Calculate historical mean value
df2['mean']=df2.loc[:,'1996-04':'2018-04'].mean(axis=1)
#Calculate std to assess variability in housing proces/volatility
df2['std']=df2.loc[:,'1996-04':'2018-04'].std(axis=1)
#Calculate coefficient of variance
df2['cv']=df2['std']/df2['mean']
#Show calculated values
dftop10 = df2[['RegionName','City','State','std','mean','ROI','cv','Incr_rate']].head(10)
display(dftop10.head(10),dftop10[['std','cv']].describe())
q3= '''SELECT *
FROM df2
WHERE RegionName IN
(SELECT RegionName
FROM dftop10
WHERE cv < .4
ORDER BY ROI DESC, Incr_rate DESC, cv ASC)
ORDER BY ROI DESC, Incr_rate DESC, cv ASC;'''
ds = pysqldf(q3).head(5)
ds.head()
These zip codes are 80211, 80238, 80203, 80220, and 80010.
#Let's remove all the calculated columns
ds = ds.loc[:, 'RegionID':'2018-04']
ds
#ds.drop(['std', 'mean', 'cv','ROI'], axis=1 ,inplace=True)
dfm = pd.melt(ds, id_vars=['RegionID','RegionName', 'City', 'State', 'Metro', 'CountyName','SizeRank'], var_name="Date", value_name="Value")
dfm['Date'] = pd.to_datetime(dfm['Date'], infer_datetime_format=True)
dfm = dfm.dropna(subset=['Value'])
dfm.info()
dfm.head()
#3744704 rows × 9 columns
#dfm.to_csv('zillowmelt_datacopy.csv', index=False)
#Let's remove the other attributes except for Date, Values, and Zip code
dfm.drop(['RegionID', 'City', 'State', 'Metro', 'CountyName','SizeRank'], axis=1, inplace=True)
display(dfm.head(),dfm.info())
#Let's print the datatype of dts and also first 5 entries of dts dataframe as our first ts exploratory step.
# Confirm that date values are used for indexing purpose in the dts dataset
dfm = dfm.rename(columns = {'Date': 'ds', 'Value': 'ts'})
dts = dfm.set_index('ds')
print(type(dts))
print(dts.head(15))
dts.index
#Let's check the decsriptive statistics for the top 5 zipcodes
dts.groupby('RegionName').ts.agg(['count','sum','median','mean','var','std','min','max'])
display(dts.head(),dts.info())
display(dts.describe())
dts.RegionName.value_counts()
top5_zip = ['80211', '80238', '80203', '80220', '80010']
map1 = folium.Map([39.7313, -104.95102], zoom_start = 12.25)
# CircleMarker with radius
folium.CircleMarker(location = [39.766499, -105.020401],
radius = 50, popup = '80211').add_to(map1)
folium.CircleMarker(location = [39.760601, -104.903297],
radius = 50, popup = '80238').add_to(map1)
folium.CircleMarker(location = [39.7313, -104.981102],
radius = 50, popup = '80203').add_to(map1)
folium.CircleMarker(location = [39.731201, -104.912903],
radius = 50, popup = '80220').add_to(map1)
folium.CircleMarker(location = [39.679001, -104.963097],
radius = 50, popup = '80210').add_to(map1)
# save as html
#map1.save(" my_map2.html ")
figsize = (20, 18)
map1
plt.figure(figsize=(20,10))
for z in top5_zip:
plt.plot(dts[dts.RegionName ==z].ts,color='b',lw=2,ls='-.', label= z)
mean_ = dts.groupby('ds').aggregate({'ts':'mean'})
std_ = dts.groupby('ds').aggregate({'ts':'std'})
plt.plot(mean_, color= 'red',label= 'Mean')
plt.plot(std_, color= 'black',label= 'StdDev')
# Add title
plt.title('Time-series graph for Top 5 zip codes', fontsize=18)
# Add Labels
plt.xlabel('Date', fontsize=16)
plt.ylabel('Value', fontsize=16)
plt.legend()
# Show graph
plt.show()
#Plot the individual zip codes
fig, axs = plt.subplots(1,5,figsize=(20,4),sharey=True)
for ax, column, index in zip(axs,top5_zip, range(0,5)):
ax.plot(dts[dts.RegionName ==column].ts, color='b',alpha=0.8)
ax.set_title(column)
ax.set_ylabel('Value')
fig.suptitle('Time-series graph for Top Zipcodes', fontsize=14, fontweight='bold');
There does appear to be an overall increasing trend.
There appears to be some differences in the variance over time.
There may be some seasonality (i.e., cycles) in the data.
Not sure about outliers.
Most time-series models assume that the underlying time-series data is stationary. This assumption gives us some nice statistical properties that allows us to use various models for forecasting.
Stationarity is a statistical assumption that a time-series has:
If we are using past data to predict future data, we should assume that the data will follow the same general trends and patterns as in the past. This general statement holds for most training data and modeling tasks.
#Perform the rolling statistics (rolling mean & rolling std dev) to check stationarity).
#Plot the moving average and the moving variance to see if it varies with time.
#This is a visualize technique of checking the stationarity of the data
from statsmodels.tsa.stattools import adfuller
def test_stationarity(df, ts):
for z in top5_zip:
df = dts[dts.RegionName == z]
# Determing rolling statistics
rolmean = df[ts].rolling(window = 12, center = False).mean()
rolstd = df[ts].rolling(window = 12, center = False).std()
# Plot rolling statistics:
orig = plt.plot(df[ts],
color = 'blue',
label = 'Original')
mean = plt.plot(rolmean,
color = 'red',
label = 'Rolling Mean')
std = plt.plot(rolstd,
color = 'black',
label = 'Rolling Std')
plt.legend(loc = 'best')
plt.title('Rolling Mean & Standard Deviation for %s' %(ts))
plt.xticks(rotation = 45)
plt.show(block = False)
plt.close()
# Perform the Augmented Dickey-Fuller (ADF) statistical test to check stationarity:
# Null Hypothesis (H_0): time series is not stationary. if pvalue > .05 or ADF Test > Critical Value
# Alternate Hypothesis (H_1): time series is stationary
print('Zip: ', z,'Results of Dickey-Fuller Test:')
dftest = adfuller(df[ts],
autolag='AIC')
dfoutput = pd.Series(dftest[0:4],
index = ['ADF Test Statistic',
'p-value',
'# Lags Used',
'Number of Observations Used'])
for key, value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
#Inspect the plots for visual determination if ts is stationary or not
test_stationarity(df = dts, ts = 'ts')
Looking at our data:
When not to use time series analysis
Time series has a particular behavior over time, there is a very high probability that it will follow the same in the future.
How to remove stationarity? Meet these three conditions:
ADCF- Augmented dickey-Fuller Test - Null hypothesis is that the TS is non-stationary. The test results comprise of test statistic and some critical values for different confidence level. The test statistic is less than the Critical Value, then we can reject the null hypothesis and say that the series is stationary.
AR -Auto regressive + MA- moving average
ARIMA -Integration
d= order of differentiation
P = autoregressive lags
Q= moving average
predict p - use PACF graph
predict q - use use ACF
2 common reasons behind non-stationarity are:
There are ways to correct for trend and seasonality, to make the time series stationary.
What happens if you do not correct for these things?
# Transformation - log ts
dts['ts_log'] = dts['ts'].apply(lambda x: np.log(x))
# Transformation - 12-month moving averages ts
dts['ts_moving_avg'] = dts['ts'].rolling(window = 12, center = False).mean()
# Transformation - 12-month moving averages of log ts
dts['ts_log_moving_avg'] = dts['ts_log'].rolling(window = 12, center = False).mean()
# Transformation - Difference between ts and moving average ts
dts['ts_moving_avg_diff'] = dts['ts'] - dts['ts_moving_avg']
# Transformation - Difference between logged ts and logged moving average ts
dts['ts_log_moving_avg_diff'] = dts['ts_log'] - dts['ts_log_moving_avg']
# Transformation - Difference between logged ts and logged moving average ts
dts_transform = dts.dropna()
# Transformation - Difference between logged ts and first-order difference logged ts
dts['ts_log_diff'] = dts['ts_log'] - dts['ts_log'].shift()
dts_transform = dts
dts_transform.dropna(inplace=True)
# Display data
display(dts_transform.head())
print('Transformation - log ts')
test_stationarity(df = dts_transform, ts = 'ts_log')
print('\n')
print('Transformation - 12-month moving averages of log ts')
test_stationarity(df = dts_transform, ts = 'ts_moving_avg')
print('\n')
print('Transformation - Difference between logged ts and logged moving average ts')
test_stationarity(df = dts_transform, ts = 'ts_log_moving_avg')
print('\n')
print('Transformation - Difference between ts and moving average ts')
test_stationarity(df = dts_transform, ts = 'ts_log_diff')
print('\n')
print('Transformation - Difference between ts and moving average ts')
test_stationarity(df = dts_transform, ts = 'ts_log_moving_avg_diff')
#Other zip code timeseries have weak stationarity.
#Only Zip =80521 has a p-value that is less than the significance level
0.027028 < .05
def plot_decomposition(df, ts, trend, seasonal, residual):
for z in top5_zip:
df = dts_transform[dts_transform.RegionName == z]
print('Zip: ', z,'Decomposition Plots:')
f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, figsize = (15, 5), sharex = True)
ax1.plot(df[ts], label = 'Original')
ax1.legend(loc = 'best')
ax1.tick_params(axis = 'x', rotation = 45)
ax2.plot(df[trend], label = 'Trend')
ax2.legend(loc = 'best')
ax2.tick_params(axis = 'x', rotation = 45)
ax3.plot(df[seasonal],label = 'Seasonality')
ax3.legend(loc = 'best')
ax3.tick_params(axis = 'x', rotation = 45)
ax4.plot(df[residual], label = 'Residuals')
ax4.legend(loc = 'best')
ax4.tick_params(axis = 'x', rotation = 45)
plt.tight_layout()
# Show graph
plt.suptitle('Trend, Seasonal, and Residual Decomposition of %s' %(ts),
x = 0.5,
y = 1.05,
fontsize = 18)
plt.show()
plt.close()
return
#Plot the moving average and the moving variance to visualize stationarity of the transformed data
def transformed_stationarity(df, ts):
"""
Test stationarity using moving average statistics and Dickey-Fuller test
Source: https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/
"""
for z in top5_zip:
df = dts_transform[dts_transform.RegionName == z]
# Determing rolling statistics
rolmean = df[ts].rolling(window = 12, center = False).mean()
rolstd = df[ts].rolling(window = 12, center = False).std()
# Plot rolling statistics:
orig = plt.plot(df[ts],
color = 'blue',
label = 'Original')
mean = plt.plot(rolmean,
color = 'red',
label = 'Rolling Mean')
std = plt.plot(rolstd,
color = 'black',
label = 'Rolling Std')
plt.legend(loc = 'best')
plt.title('Rolling Mean & Standard Deviation for %s' %(ts))
plt.xticks(rotation = 45)
plt.show(block = False)
plt.close()
# Perform the Augmented Dickey-Fuller (ADF) statistical test to check stationarity:
# Null Hypothesis (H_0): time series is not stationary. if pvalue > .05 or ADF Test > Critical Value
# Alternate Hypothesis (H_1): time series is stationary
print('Zip: ', z,'Results of Dickey-Fuller Test:')
dftest = adfuller(df[ts],
autolag='AIC')
dfoutput = pd.Series(dftest[0:4],
index = ['ADF Test Statistic',
'p-value',
'# Lags Used',
'Number of Observations Used'])
for key, value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(dts_transform['ts_log'], freq = 12)
dts_transform.loc[:,'trend'] = decomposition.trend
dts_transform.loc[:,'seasonal'] = decomposition.seasonal
dts_transform.loc[:,'residual'] = decomposition.resid
plot_decomposition(df = dts_transform,
ts = 'ts_log',
trend = 'trend',
seasonal = 'seasonal',
residual = 'residual');
dts_transform = dts_transform.dropna()
transformed_stationarity(df = dts_transform, ts = 'residual')
There is a trend in our time series and seasonality. The residuals are NOT stationary. Let's perform some differencing since the series is non-stationary and then visualize the auto-correlation plot
#https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/
# check ACF and PACF
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from matplotlib.pylab import rcParams
plt.rcParams.update({'figure.figsize':(20,12), 'figure.dpi':120})
for z in top5_zip:
dts_corr = dts_transform[dts_transform.RegionName == z]
# Original Series
fig, axes = plt.subplots(3, 2)
axes[0, 0].plot(dts_corr.ts_log); axes[0, 0].set_title('Original Series',fontsize=12)
plot_acf(dts_corr.ts_log, ax=axes[0, 1], color='indigo',alpha=0.4)
# 1st Differencing
axes[1, 0].plot(dts_corr.ts_log.diff()); axes[1, 0].set_title('1st Order Differencing',fontsize=10)
plot_acf(dts_corr.ts_log.diff().dropna(), ax=axes[1, 1], color='indigo',alpha=0.4)
# 2nd Differencing
axes[2, 0].plot(dts_corr.ts_log.diff().diff()); axes[2, 0].set_title('2nd Order Differencing',fontsize=12)
plot_acf(dts_corr.ts_log.diff().diff().dropna(), color='indigo',alpha=0.4, ax=axes[2, 1])
fig.suptitle('Autocorrelation - ' + str(z), fontsize=16, fontweight='bold')
plt.subplots_adjust(hspace = 0.4 )
plt.show()
#The time series reaches stationarity with two orders of differencing.
# PACF plot of 1st differenced series
plt.rcParams.update({'figure.figsize':(20,12), 'figure.dpi':120})
for z in top5_zip:
dts_corr = dts_transform[dts_transform.RegionName == z]
fig, axes = plt.subplots(3, 2)
axes[0, 0].plot(dts_corr.ts_log); axes[0, 0].set_title('Original Series',fontsize=12)
plot_pacf(dts_corr.ts_log, ax=axes[0, 1], color='indigo',alpha=0.4)
# 1st Differencing
axes[1, 0].plot(dts_corr.ts_log.diff()); axes[1, 0].set_title('1st Order Differencing',fontsize=10)
plot_pacf(dts_corr.ts_log.diff().dropna(), ax=axes[1, 1], color='indigo',alpha=0.4)
# 2nd Differencing
axes[2, 0].plot(dts_corr.ts_log.diff().diff()); axes[2, 0].set_title('2nd Order Differencing',fontsize=12)
plot_pacf(dts_corr.ts_log.diff().diff().dropna(), color='indigo',alpha=0.4, ax=axes[2, 1])
fig.suptitle('Partial Autocorrelation - ' + str(z), fontsize=16, fontweight='bold')
plt.subplots_adjust(hspace = 0.4 )
plt.show()
Looking at our data:
ARIMA = Auto-Regressive Integrated Moving Average.
Assumptions. The time-series is stationary.
Depends on:
1. Number of AR (Auto-Regressive) terms (p).
2. Number of I (Integrated or Difference) terms (d).
3. Number of MA (Moving Average) terms (q).
How do we determine p, d, and q? For p and q, we can use ACF and PACF plots
Autocorrelation Function (ACF). Correlation between the time series with a lagged version of itself (e.g., correlation of Y(t) with Y(t-1)).
Partial Autocorrelation Function (PACF). Additional correlation explained by each successive lagged term.
How do we interpret ACF and PACF plots?
Looking at our data:
def plot_acf_pacf(df, ts):
"""
Plot auto-correlation function (ACF) and partial auto-correlation (PACF) plots
"""
f, (ax1, ax2) = plt.subplots(1,2, figsize = (10, 5))
#Plot ACF:
ax1.plot(lag_acf)
ax1.axhline(y=0,linestyle='--',color='gray')
ax1.axhline(y=-1.96/np.sqrt(len(df[ts])),linestyle='--',color='gray')
ax1.axhline(y=1.96/np.sqrt(len(df[ts])),linestyle='--',color='gray')
ax1.set_title('Autocorrelation Function for %s' %(ts))
#Plot PACF:
ax2.plot(lag_pacf)
ax2.axhline(y=0,linestyle='--',color='gray')
ax2.axhline(y=-1.96/np.sqrt(len(df[ts])),linestyle='--',color='gray')
ax2.axhline(y=1.96/np.sqrt(len(df[ts])),linestyle='--',color='gray')
ax2.set_title('Partial Autocorrelation Function for %s' %(ts))
plt.tight_layout()
plt.show()
plt.close()
return
#ACF and PACF plots: Find value of P & Q
from statsmodels.tsa.stattools import acf, pacf
for z in top5_zip:
df = dts_transform[dts_transform.RegionName == z]
print('Zip:', z)
# determine ACF and PACF
lag_acf = acf(np.array(dts_transform['ts_log_diff']))
lag_pacf = pacf(np.array(dts_transform['ts_log_diff']))
# plot ACF and PACF
plot_acf_pacf(df = dts_transform, ts = 'ts_log_diff')
#https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/
# compute residuals and create an ARMA model of the residuals.
import pmdarima as pm
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.seasonal import seasonal_decompose
residuals_list = []
for z in top5_zip:
df_z = dts_transform[dts_transform.RegionName == z]
ts = df_z.ts
decomposition = seasonal_decompose(np.log(ts))
# Gather the trend, seasonality and noise of decomposed object
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
df_z['residual'] = residual
residuals_list.append(residual)
result = adfuller(residual.dropna())
print('Zip: ', z)
print('ADF Statistic Residual: %f' % result[0])
print('p-value: %f' % result[1])
plt.show()
#Build Model
#Use a stepwise approach to search multiple combinations of p,d,q parameters and chooses the best model (< AIC)
model = pm.auto_arima(df_z.ts.dropna(), start_p=1, start_q=1,
test='adf', # use adftest to find optimal 'd'
max_p=5, max_q=5, # maximum p and q
m=12, # frequency of series
d=2,
seasonal=False, # No Seasonality
start_P=0,
D=0,
trace=True,
error_action='ignore',
suppress_warnings=True,
stepwise=True)
print(z)
print(model.summary())
model.plot_diagnostics(figsize=(8, 6))
plt.show()
# Create Training and Test
#create the train and test dataset by splitting the time series into 2 contiguous parts in approximately 60:40 ratio or a reasonable proportion based on time frequency of series.
train = df_z.ts[:218]
test = df_z.ts[218:]
test_index = df_z.index[-40:]
model2 = ARIMA(train, order=model.order)
fitted = model2.fit(disp=-1)
# Forecast
fc, se, conf = fitted.forecast(40, alpha=0.05) # 95% conf
# Make as pandas series
fc_series = pd.Series(fc, index=test_index)
lower_series = pd.Series(conf[:, 0], index=test_index)
upper_series = pd.Series(conf[:, 1], index=test_index)
# Plot
plt.figure(figsize=(8,2), dpi=100)
plt.plot(train, label='training')
plt.plot(test, label='actual')
plt.plot(fc_series, color='darkgreen',label='forecast')
plt.fill_between(lower_series.index, lower_series, upper_series,
color='k', alpha=.15)
plt.title('Forecast vs Actuals')
plt.legend(loc='upper left', fontsize=8)
plt.show()
#Forecast - 5 years in the future
n_periods = 60
fc, confint = model.predict(n_periods=n_periods, return_conf_int=True)
index_of_fc = pd.date_range('2018-05-01', periods=60, freq='M')
# make series for plotting purpose
fc_series = pd.Series(fc, index=(index_of_fc))
lower_series = pd.Series(confint[:, 0], index=(index_of_fc))
upper_series = pd.Series(confint[:, 1], index=(index_of_fc))
# Plot
plt.figure(figsize=(8,2), dpi=100)
plt.plot(fc_series, color='darkgreen',label='forecast')
plt.plot(df_z.ts)
plt.fill_between(lower_series.index,
lower_series,
upper_series,
color='k', alpha=.15)
plt.title("Final Forecast " + str(z) )
plt.legend(loc='upper left', fontsize=8)
plt.show()
The Top 5 zip codes are all in Colorado State:
These are the zip codes that have the greatest return on investment (ROI) but with low volatility as evidenced by low variation.
Please note that ROI can be impacted by: